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Abstract 



| We show that all non-negative submodular functions have high noise-stability. As a consequence, 

we obtain a polynomial-time learning algorithm for this class with respect to any product distribution on 
{-1, 1}" (for any constant accuracy parameter e). Our algorithm also succeeds in the agnostic setting. 
q ■ Previous work on learning submodular functions required either query access or strong assumptions 

about the types of submodular functions to be learned (and did not hold in the agnostic setting). 

oo ' 1 Introduction 

\0 . A function f : 2 M — > R. is submodular if 

O ■ 

^6 i VS, T c [n] : /(5 U T) + /(S n T) < f(S) + /(T). 

O 

Submodular functions have been extensively studied in the context of combinatorial optimization [Edm71, 
NWF78, FNW78, Lov83] where the functions under consideration (such as the cut function of a graph) are 
submodular. An equivalent formulation of submodularity is that of decreasing marginal returns, 

"3 ■ VS c T c [n], i e [n] \ T : f(T U {/}) - /(T) < /(5 U {/}) - /(5), 

and thus submodular functions are also a topic of study in economics and the algorithmic game theory 
community [DNS06, MR07]. In most contexts, the submodular functions considered are non-negative 
[DNS06, FMV07, MR07, Von09, 0V11, BH11, GHRU11], and we will be focusing on non-negative sub- 
modular functions as well. 

The main contribution of this paper is a proof that non-negative submodular functions are noise sta- 
ble. Informally, a noise stable function / is one whose value on a random input x does not change much 
if x is subjected to a small, random perturbation. Noise stability is a fundamental topic in the analysis 
of Boolean functions with applications in hardness of approximation, learning theory, social choice, and 
pseudorandomness [KKL88, HasOl, BKS99, O'D04, KOS04, MOO10]. 

In order to define noise stability, we first define a noise operator that acts on {-1, 1}". 
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Definition 1 (Noise operators) For any product distribution IT = ITi XII2X • • • xIT„ over {-1, 1}", p e [0, 1], 
x € {-1, lj", let the random variable y drawn from the distribution N p (x) over {-1, 1}" have z/,- = Xi with 
probability p and be randomly drawn from IT; with probability 1 —p. The noise operator T p on f : {-1, 1}" — > 
R defined by letting T p f : {-1, 1}™ — » R. fee the function given by T p f(x) = E^^w /(#)• 

N.B.: For the unifomi distribution ;/ ~ N p (x) has - x\ with probability 1/2 + p/2, and y\ = -Xj with 
probability 1/2 -p/2. 

Now we can precisely define noise stability: 

Definition 2 (Noise stability) 77?e noise stability off at noise rate p is defined to be 

S p (f) = <f, T P f) = E x „ n [f(x)T p f(x)]. 

The precise statement of our main theorem is as follows (see Section 2 for definitions): 

Theorem 3 Let IT = IIi x IT2 x • • • x IT„ be a product distribution over {-1, 1}" with minimum probability 
p m i n and let f : {-1, 1}" — > Mf be a submodular function. Then for all, p e [0, 1], 

S p (/)>(2p-l+2p mi „(l-p))||/||2. 

N.B.: For the uniform distribution we get the bound S p (f) > pH/H?- 

Given the high noise-stability of submodular functions, we can apply known results from computational 
learning theory to show that submodular functions are well- approximated by low-degree polynomials and 
can be learned agnostically. Our main learning result is as follows: 

Corollary 4 Let C be the class of non-negative submodular functions with \\f\h = 1 and let D be any 
distribution on {-1, 1}" X K. such that the marginal distribution over {-1, \} n is a product distribution. Then 
there is a statistical query algorithm that outputs a hypothesis h with probability 1 - 5 such that 

E(x, y yn[\h{x) - y\] < opt + e, 

in time poly(?i 0(1/e2) ,log(l/(5)), 

Here opt is the Li-error of the best fitting concept in the concept class. (See Section 4 for the precise 
definition.) Note that the above algorithm will succeed given only statistical query access [Kea98] to the 
underlying function to be learned. It can be shown that the L2-norm of a submodular function is always 
within a constant factor of its mean squared. Thus, the algorithm can estimate the Z^-norm of the submodular 
function / to very high accuracy using Chernoff-Hoeffding bounds and scale the function by its mean so 
that its L2-norm is 1. 

1.1 Related Work 

Recently the study of learning submodular functions was initiated in two very different contexts. Gupta et al. 
[GHRU11] gave an algorithm for learning bounded submodular functions that arose as a technical necessity 
for differentially privately releasing the class of disjunctions. Their learning algorithm requires value queiy 
access to the target function, but their algorithm works even when the value queries are answered with 
additive error (value queries that are answered with additive error at most r are said to be r-tolerant). 



2 



Theorem 5 ([GHRU11]) Let e, 8 > and let II be any product distribution over [n]. There is a learning 
algorithm that when given (e/4-tolerant) value query access to any submodular function f : 2™ — > [0, 1], 
outputs a hypothesis h in time n ( lo s( l /S)/£-) sucn that, 



The learning algorithm of Gupta et al. crucially relies on its query access to the submodular function in 
order to break the function down into Lipschitz continuous parts that are easier to learn. Compare this 
to Corollary 4, which has similar learning guarantees, but where the learner only has access to statistical 
queries and can learn in the agnostic model of learning. (See Section 4.) 

The other recent work [BH11] on learning submodular functions was motivated by bundle pricing and 
used passive supervised learning as a model for learning consumer valuations of added options. In par- 
ticular, they have a polynomial-time algorithm that can learn (using random examples only) monotone, 
non-negative, submodular functions within a multiplicative factor of V" over arbitrary distributions. As our 
machineiy breaks down over non-product distributions, none of our results hold in this setting. For product 
distributions, Balcan and Harvey gave the first poly(rc, l/e)-time algorithm that can learn (using random 
examples only) monotone, non-negative, Lipschitz submodular functions with minimum value m within a 
multiplicative factor of 0(log(l/e)/wi). 

1.2 Applications to Differential Privacy 

We discuss some applications to differential privacy in Section 5. In particular, we obtain a simple proof of 
Gupta et al. [GHRUll]'s recent result on releasing disjunctions with improved parameters. 

2 Preliminaries 

Throughout, we will identify sets 5 c [n] with their indicator vectors 1(5) e {-1,1}" where 1(5 ); = 1 if 
i e 5 and 1(5); = -1 if / £ 5 (as opposed to the usual (0, l)-indicator vectors). For any distribution over II 
over {-1, 1}", we define the inner product on functions /, g : {-1, 1}" — > R by {f, g) = B x ^n[f(x)g(x)] and the 
L 2 -norm of a function of / as ||/|| 2 = yj{f,f) = V E -v~n[/(*) 2 ]- 

Definition 6 (Minimum probability) Let II = II i x II2 x • • • x IL. be a product distribution over {-1,1}", 
and let p t :- Pr r ^ n ,[^ = 1], then p min = rmn fcW {/?,}. 

3 Submodular Functions are Noise Stable 

We will start by showing that submodular functions are noise stable under the uniform distribution as a 
warm-up as the notation is less cumbersome in this setting. In Section 3.2 we will prove Theorem 3 in the 
general setting of arbitrary product distributions. 

3.1 Uniform Distribution 

For the rest of Section 3.1 we will assume that the distribution over inputs is uniform. 
Let the Fourier expansion of / be given by 2sc[«] f(S )xs > it can t> e shown that 



Pr [|/(5)-/<5)|<e]>l-<5. 



T P f(x)= J] p lsl f\S) Xs (x), 



and thus S p (f) - 2 P lSl f( S ? 



SEW 



SQ[n] 
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The following lemma is our key observation. 
Lemma 7 Let f : {-1, 1}' ! -> Itea submodular function. Then for all x € {-1, 1}", p € [0, 1], 

T p f{x) > pf(x) + ((1 -p)/2)(/(-l") + /(l")). 
Forf : {-1, If -> R\ T p f(x) > pf(x). 

Proof. We will be viewing the domain of / as 2 M , and the input x € {-1,1}" as X € 2 M such that 1(X) = x. 
For a fixed x € {-1, l}' 1 , let 7r : [n] — > [«] be a permutation such that x n( \) > ■ • ■ > x n ( n ), and then define 
Xj - (7r(l) , . . . , n(j)}. (N.B.: Xo = and X n - [n].) Finally, we define x^o) - 1 and x n ( n +i) - -1. Note that 
there is only one value 7 e {0 , . . . , n) for which x^/) ^ ^(y+i)- 



Y,f(YnXj)-f(YnX M ) 



7=1 



> /(X () ) + Ey„ W J/((yn Nj)}) U X h ! ) - /(Xj_ 1 ) 



7=1 

— Y^-{f{Xj)-f{x hl )) 

7=1 

" 1 — 

= 2 2 ~ ^0+i))/(^7) + -^-(f(Xo) + /(X„)) 
7=o 

= p/W + ^(/(X ) + /(X,)). 

The inequality is due to the decreasing marginal returns characterization of the submodularity of /. The 
equality after that comes from moving the expectation inside and observing that each summand is non-zero 
only if n(j) e Y. This happens with probability (1 +p)/2 when x„(j) = 1 and with probability (1 - p)/2 when 

Ml) = ~ l - 

Remark. The proof technique is not new (for instance it was used by Madiman and Tetali [MT10] to show a 
large class of Shannon-type inequalities for the joint entropy function). In fact, it can be viewed as a special 
case of the "Threshold Lemma" [Von09]. However, to the best of our knowledge the statement of Lemma 7 
has never been expressed using the language of noise operators. 

Corollary 8 Let f : {-1, 1}" — > R + be a submodular function. Then for all p € [0, 1], 

S P (f) > p\\f\\l 



3.2 Product Distributions 

For the rest of Section 3.2 we will assume that the distribution is a product distribution IT = ITi XIT2 x • • • xIL. 
on {-1, 1}" with minimum probability p m i„. 

Lemma 9 Let IT = ITi x II2 x • • • x Il„ be a product distribution over {-1, 1}" with minimum probability Pmin 
and let f : {-1, 1}" — > IR + be a submodular function. Then for all x € {-1, 1}", p e [0, 1], 

T p f{x) > ((2p - 1) + 2p min (l -p))f{x) 
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Proof. As in the proof of Lemma 7, we will be viewing the domain of / as 2 W , and the input x e {-1,1}" 
as X e 2^ such that 1(X) = x. For a fixed x € {-1, 1}", let n : [n] —> [n] be a permutation such that 
x n (\) > ••• > x n ( n ), and then define X; = {tt(1) , . . . (N.B.: Xq = and X n - [«].) 



®Y~N pi x)f(Y) = /(X ) + Ey„ w 2/(ynx y -)-/(ynx ; -_i) 



n 



> f(X ) + E Y „ Np(X) £ f((Y n {</)}) U Xj.{) - f{X hl ) 



- /(*o) + 2 



./=! 



1 1 



(/(.Y / )-/(.Y / ,)) 



^ (1 - P) (xrtj)Prt j) ~ Mj+UPrtj+l)) ~ 1 2 ~ Pj ( Xff 0') _ *</+!)) 

1 1 + 

.2 2 



/(*;•) 



- ^r(n)(l ~P)(1 - Pn(n))jf(X n ) 
> (2p - 1 + 2 Pmin (l - p))f(X). 

The first inequality comes from using submodularity in each term of the summation. The equality after that 
comes from moving the expectation inside and observing that each summand is non-zero only if n(j) e Y. 
This happens with probability p + (1 - p)p n (j) when x n (j) = 1 and with probability (1 - p)(l - p n <j)) when 
Xjr(j) = -1. Finally, the last line follows by the non-negativity of / and observing that for any values of 
x n (\) and x n ( n ), the coefficients of f{%) and f([n]) are non-negative and the coefficient of f(X) is at least 
(2p - 1 + 2p min (l -p)). 

As with Fourier analysis over the uniform distribution, it can be easily verified that 

Sptf) = (f,T p f) = J] p lSl f(S) 2 

SQ[n] 

over any product distribution n, where the Fourier coefficients are now defined with respect to the Gram- 
Schmidt orthonormalization of the x basis with respect to the the ri-norm [Bah61, FJS91]. Thus, once 
again we get a lower-bound on the noise-stability of submodular functions as an immediate consequence of 
Lemma 9. 

Theorem 10 (Theorem 3 Restated) Let IT = ITi x IT- x • • • x W n be a product distribution over {-1, l}' 1 with 
minimum probability p m i n and let f : {-1, 1}" — > Mf be a submodular function. Then for all, p e [0, 1], 

S p (/)>(2p-l+2p mi „(l-p))||/||2 



4 Learning 

In the agnostic learning framework [KSS94], the learner receives labelled examples (x, y) drawn from a 
fixed distribution over example-label pairs. 
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Definition 11 (Agnostic Learning) Let D be any distribution on {-1, 1}" x R swc/2 that the marginal distri- 
bution over {-1, 1}" is a product distribution IT. Define 



opt = mmEt^fl [|/(x) - . 



r/ja? w, ojcf the error of the best fitting L\-approximation in C with respect to D. 

We say that an algorithm A agnostically learns a concept class C over U if the following holds for any 
D with marginal IT: if A is given random examples drawn from D, then with high probability A outputs a 
hypothesis h such that ^( x ,y)~!D [\h(x) - y\] < opt + e. 

The following lemma, considered folklore (see [KOS04]), shows that noise stable functions are well- 
approximated by low-degree polynomials. 

Lemma 12 Let fl = IIi x II2 x • • • x n„ be a product distribution over {-1, 1}", and let f : {-1, 1}" — > R be a 
function such that \\f\2 = 1 and S p (f) > 1 — 1y. Then there exists a multilinear polynomial p : {-1,1}" — > R 
of degree 2/(1 — p) such that 



The "Li Polynomial Regression Algorithm" due to Kalai et al. [KKMS08] shows that one can agnosti- 
cally learn low-degree polynomials. 

Theorem 13 ([KKMS08]) Suppose E x „ Vx [(f-p) 2 ] < e 2 for some degree d polynomial p, some distribution 
D on X X R where the marginal Dx is a product distribution on {-1, 1}", and any f in the concept class C. 
Then, with probability 1-6, the L\ Polynomial Regression Algorithm outputs a hypothesis h such that 
E(x,y)~£> [\H X ) ~ y\] ^ °pt + e in time poly(n rf /e, log(l/5)). 

Corollary 14 Let C be the class of non-negative submodular functions with \\fW2 = 1 and let D be any 
distribution on {-1, 1}" XR such that the marginal distribution over {-1, 1}" is a product distribution. Then 
for all f £ C, the L\ Polynomial Regression Algorithm outputs a hypothesis h with probability 1-6 such 
that 



We note that the L\ Polynomial Regression Algorithm can be implemented as a statistical query algo- 
rithm [Kalll]. (N.B.: The access offered to the learning algorithm by the statistical queiy model is much 
weaker than that offered by random examples or the tolerant value query model. The tolerant value queiy 
model allows arbitrary value queries that get answered with some noise, whereas the statistical query model 
requires that the queries to be of the form g ; {-1, 1}" X R — > R where g is computable by a poly(n, 1 /e)-size 
circuit, and the answer is ^(x,y)~£)\.g{x> y)] with some noise.) 

Corollary 15 (Corollary 4 Restated) Let C be the class of non-negative submodular functions with \\fW2 - 
1 and let D be any distribution on {-1, 1}" X R such that the marginal distribution over {-1,1}" is a product 
distribution. Then for all f € C, there is a statistical query algorithm that outputs a hypothesis h with 
probability 1-6 such that 




E ( x,i/)~£> [\h(x) - y\] < opt + e, 
given random examples in time po\y(n°^^^/e, log(l/(5)) . 



E(x,ij)~d[\Kx) - y\] < opt + e, 



in time poly(« 0(1/e2) , log(l/<5)). 
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5 Private Query Release and Low-Degree Polynomials 



In this section, we make a simple observation connecting approximability by low-degree polynomials with 
private query release. 

In the context of differential privacy, we will call D c X a database and two databases D,D' c X are 
adjacent if one can be obtained from the other by adding a single item. 

Definition 16 (Differential privacy [DMNS06]) An algorithm A : X* —> R is e-differentially private if for 
all Q c R and every pair of adjacent databases D, D', we have Pr[A(D) € Q] < e £ Pr[A(D') e Q]. 

A counting query over a database D is just the average value of a queiy over each entiy in the database. 

Definition 17 (Counting Query Function) Let c : X — > R be a real-valued query function. For a fixed 
r € X, let q,-(c) := c(r). For a class of queries C and a fixed database D c X, the counting query function 
CQ D : C — > R is the function defined by CQ D (c) := - YireD Qr( c ) = \ T,reD c ( r )- 

A counting query releasing algorithm's objective is to release a data structure H whose answers on 
queries c e C are close to those of the counting query over the original database D. 

Definition 18 (Counting query release [GHRU11]) Let C be a class of queries c from X — > R, and let II 

be a distribution on C. We say that an algorithm A (a,/3)-releases C over a database D of size n, if for 
H = A{D), 

Pr[|CQ ( C )-#( C )|<c*]>l-/?. 

c~n 

The following proposition is implicit in [GHRU1 1] using results of [BDMN05] and [KLN+08]. 

Proposition 19 For a given concept class C with distribution IT, if there is a query learning algorithm 
for the concept class {CQ^ : D c X) using q T-tolerant value queries that outputs a hypothesis H s.t. 
P r cen[|CQo(c) - H(c)\ < a] > 1 -/?, then there is an e-differentially private algorithm that (a,f$)-releases 
Cfor any database of size \D\ > q(log q + log(l/5))/er. 

For instance, Gupta et al. [GHRU1 1] show that for C, the class of disjunctions, the class {CQ D : D c X} is 
a submodular function. Thus, their tolerant value query learning algorithm for submodular functions leads 
to a private counting query release algorithm. 

We make the following observation. For a given concept class C with distribution n, if for every r e X, 
q r is well-approximated by a low-degree polynomial with respect to II, then CQ D is also well-approximated 
by a low-degree polynomial with respect to II. As statistical queries are strictly weaker than tolerant value 
queries, the L\ Polynomial Regression Algorithm satisfies the requirements of Proposition 19, and we have 
a private counting query release algorithm for C. We note that it is easy to see that a 0(log(l/a))-degree 
polynomial can L\ -approximate q r to within a, when C is the class of disjunctions, and II is the uniform 
distribution. (If \r\ = 0(log(l/o')), a 0(log(l/aO)-degree polynomial can interpolate the function exactly. 
Otherwise, the constant 1 function is within a of q,.) Thus, we are able to retrieve the result of Gupta et 
al. [GHRU11] on releasing disjunctions easily with an improved running -time of \x\ 0(log ^/ a ^ as opposed 

\x\ 0{l/a2) . 
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